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Abstract 



In this paper, we first demonstrate that 6-bit minwise hashing, whose estimators are positive definite kernels, can be 
naturally integrated with learning algorithms such as SVM and logistic regression. We adopt a simple scheme to 
transform the nonlinear (resemblance) kernel into linear (inner product) kernel; and hence large-scale problems can 
be solved extremely efficiently. Our method provides a simple effective solution to large-scale learning in massive 
and extremely high-dimensional datasets, especially when data do not fit in memory. 

We then compare 6-bit minwise hashing with the Vowpal Wabbit (VW) algorithm (which is related the Count-Min 
(CM) sketch). Interestingly, VW has the same variances as random projections. Our theoretical and empirical com- 
parisons illustrate that usually 6-bit minwise hashing is significantly more accurate (at the same storage) than VW 
(and random projections) in binary data. Furthermore, 6-bit minwise hashing can be combined with VW to achieve 
further improvements in terms of training speed, especially when 6 is large. 



With the advent of the Internet, many machine learning applications are faced with very large and inherently high- 
dimensional datasets, resulting in challenges in scaling up training algorithms and storing the data. Especially in the 
context of search and machine translation, corpus sizes used in industrial practice have long exceeded the main memory 
capacity of single machine. For example, ll34l experimented with a dataset with potentially 16 trillion (1.6 x 10^'^) 
unique features. 11321 discusses training sets with (on average) 10^^ items and 10^ distinct features, requiring novel 
algorithmic approaches and architectures. As a consequence, there has been a renewed emphasis on scaling up machine 
learning techniques by using massively parallel architectures; however, methods relying solely on parallelism can 
be expensive (both with regards to hardware requirements and energy costs) and often induce significant additional 
communication and data distribution overhead. 

This work approaches the challenges posed by large datasets by leveraging techniques from the area of similarity 
search |2 |, where similar increase in dataset sizes has made the storage and computational requirements for computing 
exact distances prohibitive, thus making data representations that allow compact storage and efficient approximate 
distance computation necessary. 

The method of 6-bit minwise hashing ll26l |27]|25]| is a very recent progress for efficiently (in both time and space) 
computing resemblances among extremely high-dimensional (e.g., 2^^) binary vectors. In this paper, we show that 
6-bit minwise hashing can be seamlessly integrated with linear Suppoit Vector Machine (SVM) Il22l [30] [141 [T9] |35l 
and logistic regression solvers. In 1*351, the authors addressed a critically important problem of training linear SVM 
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when the data can not fit in memory. In this paper, our work also addresses the same problem using a very different 
approach. 



1.1 Ultra High-Dimensional Large Datasets and Memory Bottleneck 

In the context of search, a standard procedure to represent documents (e.g., Web pages) is to use w-shingles (i.e., w 
contiguous words), where w can be as large as 5 (or 7) in several studies ||6] |7] [15]. This procedure can generate 
datasets of extremely high dimensions. For example, suppose we only consider 10^ common English words. Using 
w = 5 may require the size of dictionary ft to he D — = 10^'^ — 2^^. In practice, D — 2^* often suffices, as 
the number of available documents may not be large enough to exhaust the dictionary. For w-shingle data, normally 
only abscence/presence (0/1) information is used, as it is known that word frequency distributions within documents 
approximately follow a power-law [3|, meaning that most single terms occur rarely, thereby making a lu-shingle 
unlikely to occur more than once in a document. Interestingly, even when the data are not too high-dimensional, 
empirical studies ['9', TS', '20 1 achieved good performance with binary-quantized data. 

When the data can fit in memory, linear S VM is often extremely efficient after the data are loaded into the memory. 
It is however often the case that, for very large datasets, the data loading time dominates the computing time for training 
the SVM |35|. A much more severe problem arises when the data can not fit in memory. This situation can be very 
common in practice. The publicly available webspam dataset needs about 24GB disk space (in LIBSVM input data 
format), which exceeds the memory capacity of many desktop PCs. Note that webspam, which contains only 350,000 
documents represented by 3-shingles, is still a small dataset compared to the industry applications [|32| . 

1.2 A Brief Introduction of Our Proposal 

We propose a solution which leverages b-bit minwise hashing. Our approach assume the data vectors are binary, very 
high-dimensional, and relatively sparse, which is generally true of text documents represented via shingles. We apply 
fe-bit minwise hashing to obtain a compact representation of the original data. In order to use the technique for efficient 
learning, we have to address several issues: 

• We need to prove that the matrices generated by 6-bit minwise hashing are indeed positive definite, which will 
provide the soUd foundation for our proposed solution. 

• If we use 6-bit minwise hashing to estimate the resemblance, which is nonlinear, how can we effectively convert 
this nonlinear problem into a linear problem? 

• Compared to other hashing techniques such as random projections, Count-Min (CM) sketch lfT2l . or Vowpal 
Wabbit (VW) |34|, does our approach exhibits advantages? 

It turns out that our proof in the next section that fe-bit hashing matrices are positive definite naturally provides the 
construction for converting the otherwise nonlinear SVM problem into linear SVM. 

fSSl proposed solving the memory bottleneck by partitioning the data into blocks, which are repeatedly loaded 
into memory as their approach updates the model coefficients. However, the computational bottleneck is still at the 
memory because loading the data blocks for many iterations consumes a large number of disk I/Os. Clearly, one 
should note that our method is not really a competitor of the approach in |35|. In fact, both approaches may work 
together to solve extremely large problems. 

2 Review Minwise Hashing and b-Bit Minwise Hashing 

Minwise hashing |I6] |2l has been successfully applied to a very wide range of real-world problems especially in the 
context of search ill H g] [l6l [lOl [11 [33l El] [O] [II] [HI for efficiently computing set similarities. 

Minwise hashing mainly works well with binary data, which can be viewed either as 0/1 vectors or as sets. Given 
two sets, 5i, 52 C 17 = {0, 1, 2, — 1}, a widely used (normalized) measure of similarity is the resemblance R: 



R = 



\SinS2\ 

|5iU52| 



/i + /2 - a ' 



a 



where h = l^i|, h = l^2|, a= l^iH^aj. 
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In this method, one applies a random permutation t: : fl ^ on Si and 52- The colHsion probability is simply 

Pr (min(7r(5i)) = min(7r(52))) = = R. (1) 

One can repeat the permutation k times: tti, tt2, ttu to estimate R without bias, as 

1 ^ 

^ A: ^ l{inin(7rj(S'i)) = min(7rj (52))}, (2) 

Var [Rm) - \r{1 - R). (3) 

The common practice of minwise hashing is to store each hashed value, e.g., min(7r(S'i)) and min(7r(S'2)), using 
64 bits fi5\. The storage (and computational) cost will be prohibitive in truly large-scale (industry) applications [|28l . 

b-bit minwise hashing ||26ll27ll25l provides a strikingly simple solution to this (storage and computational) problem 
by storing only the lowest h bits (instead of 64 bits) of each hashed value. For convenience, we define the minimum 
values under vr: zi = min (tt (5*1)) and Z2 — min (tt (>S'2)), and define ei,i — ith lowest bit of zi, and 62, i = ith 
lowest bit of Z2. 



(4) 



Theorem 1 4261/ Assume D is large. 



Pb = Pr 1 = ^2,.}^ - Ci^b + (1 - C2,b) R 
fi h 

Cl,6 = Ai^ft h ^2,6 , C'2,6 = b ^ A2.b- 



ri +r2 ' ri +r2 ' ' ri + r2 ' ri + r2 

l-[l-ri]' l-[l-r2]' 

This (approximate) formula © is remarkably accurate, even for very small D. Some numerical comparisons with the 
exact probabilities are provided in Appendix lAl 

We can then estimate Pb (and R) from k independent permutations: tti, tt2, TTk, 

4 = = ^ E In = ^2,.., }| , (5) 

j=l Ki=l ) 



^ ^ [1 - C2,fc]' ^ [1 - C2A 

We will show that we can apply 6-bit hashing for learning without explicitly estimating R from (|4]l. 



3 Kernels from Minwise Hashing and 6-Bit Minwise Hashing 

This section proves some theoretical properties of matrices generated by resemblance, minwise hashing, or 6-bit min- 
wise hashing, which are all positive definite matrices. Our proof not only provides a solid theoretical foundation for 
using 6-bit hashing in learning, but also illustrates our idea behind the construction for integrating 6-bit hashing with 
hnear learning algorithms. 

Definition: A symmetric n x n matrix K satisfying "Ylij CiCjKij > 0, for all real vectors c is called positive definite 
(PD). Note that here we do not differentiate PD from nonnegative definite. 
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Theorem 2 Consider n sets Si, S2, S'„ G = {0, !,...,£)— 1}. Apply one permutation tt to each set and define 
Zi — min{7r(S'i)}. The following three matrices are all PD. 

1. The resemblance matrix R G K."^", whose {i,j)-th entry is the resemblance between set Si and set Sj: Rij = 

\S^^^Sj\ _ \SinSj\ 

IS.US3I ~ ISil + ISjI-ISiHSj 

2. The minwise hashing matrix M e M"^"; M^- — l{zi = Zj} 

3. The b-bit minwise hashing matrix M^^-' G M"^"; Mj^^' — Y['t=i 1 {^i,* — ej,t}, where Ci^t is the t-th lowest bit 
ofzi. 

Consequently, consider k independent permutations and denote m|^| the b-bit minwise hashing matrix generated 
by the s-th permutation. Then the summation X]s=i ■'^(s) '^^''^ 

Proof: A matrix A is PD if it can be written as an inner product B'^B. Because 

D-l 

My ^ l{z, = Zj} = J2 = X l{zj = t}, (7) 
t=o 

Mij is the inner product of two D-dim vectors. Thus, M is PD. 
Similarly, the b-bit minwise hashing matrix M^''^ is PD because 

2''-l 

= E H^. = X l{z, = t}. (8) 
t=o 

The resemblance matrix R is PD because Rij — Pr{My — 1} = E (Mij) and Mij is the (i,j)-th element of the 
PD matrix M. Note that the expectation is a linear operation. □ 

Our proof that the 5-bit minwise hashing matrix M*^^' is PD provides us with a simple strategy to expand a nonlinear 
(resemblance) kernel into a linear (inner product) kernel. After concatenating the k vectors resulting from (|8}, the new 
(binary) data vector after the expansion will be of dimension 2'' x k with exactly k ones. 



4 Integrating 6-Bit Minwise Hashing with (Linear) Learning Algorithms 

Linear algorithms such as linear SVM and logistic regression have become very powerful and extremely popular. 
Representative software packages include SVMP'='^ 122J, Pegasos [30], Bottou's SGD SVM |5|, and LIBLINEAR 1 14]. 

Given a dataset {{xi,yi)}"^i, G M.^, yi G {—1,1}, the L2-regularized linear SVM solves the following 
optimization problem: 

1 " 

min -w-"- w + C 7 max {1 — 0| , (9) 

and the L2-regularized logistic regression solves a similar problem: 

min -w'^w + C^log^l + e"^-'^^''') . (10) 
1=1 

Here C > is an important penalty parameter Since our purpose is to demonstrate the effectiveness of our pro- 
posed scheme using 6-bit hashing, we simply provide results for a wide range of C values and assume that the best 
performance is achievable if we conduct cross-validations. 

In our approach, we apply k independent random permutations on each feature vector and store the lowest b 
bits of each hashed value. This way, we obtain a new dataset which can be stored using merely nbk bits. At run-time, 
we expand each new data point into a 2^ x fc-length vector. 
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For example, suppose fc = 3 and the hashed values are originally {12013, 25964, 20191}, whose binary digits are 
{010111011101101, 110010101101100,100111011011111}. Consider 6 = 2. Then the binary digits are stored as 
{01, 00, 11} (which corresponds to {1, 0, 3} in decimals). At run-time, we need to expand them into a vector of length 
2''k = 12, to be {0, 0,1,0, 0, 0, 0, 1, 1,0, 0, 0}, which will be the new feature vector fed to a solver: 

Original hashed values (fc = 3) : 12013 25964 20191 

Original binaiy representations : 010111011101101 110010101101100 100111011011111 

Lowest 6 2 binary digits : 01 00 11 

Expanded 2'' = 4 binary digits : 0010 0001 1000 

New feature vector fed to a solver : {0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0} 

Clearly, this expansion is directly inspired by the proof that the 6-bit minwise hashing matrix is PD in Theorem|2] 
Note that the total storage cost is still just nbk bits and each new data vector (of length 2'' x k) has exactly k Vs. 
Also, note that in this procedure we actually do not expUcitly estimate the resemblance R using Q. 



5 Experimental Results on Webspam Dataset 

Our experimental settings follow the work in Ii35 j very closely. The authors of [351 conducted experiments on three 
datasets, of which the webspam dataset is pubUc and reasonably high-dimensional (n = 350000, D = 16609143). 
Therefore, our experiments focus on webspam. Following ll35l . we randomly selected 20% of samples for testing and 
used the remaining 80% samples for training. 

We chose LIBLINEAR as the tool to demonstrate the effectiveness of our algorithm. All experiments were con- 
ducted on workstations with Xeon(R) CPU (W5590@3.33GHz) and 48GB RAM, under Windows 7 System. Thus, in 
our case, the original data (about 24GB in LIBSVM format) fit in memory. In applications for which the data do not 
fit in memory, we expect that 5-bit hashing will be even more substantially advantageous, because the hashed data are 
relatively very small. In fact, our experimental results will show that for this dataset, using k = 200 and b ~ 8 can 
achieve the same testing accuracy as using the original data. The effective storage for the reduced dataset (with 350K 
examples, using k = 200 and 6 = 8) would be merely about 70MB. 

5.1 Experimental Results on Nonlinear (Kernel) SVM 

We implemented a new resemblance kernel function and tried to use LIBSVM to train the webspam dataset. We waited 
for over one weelflbut LIBSVM still had not output any results. Fortunately, using 6-bit minswise hashing to estimate 
the resemblance kernels, we were able to obtain some results. For example, with C = 1 and 6 = 8, the training time 
of LIBSVM ranged from 1938 seconds (k = 30) to 13253 seconds (k = 500). In particular, when k > 200, the test 
accuracies essentially matched the best test results given by LIBLINEAR on the original webspam data. 

Therefore, there is a significant benefit of data reduction provided by 6-bit minwise hashing, for training nonlinear 
SVM. This experiment also demonstrates that it is very important (and fortunate) that we are able to transform this 
nonlinear problem into a linear problem. 

5.2 Experimental Results on Linear SVM 

Since there is an important tuning parameter C in linear SVM and logistic regression, we conducted our extensive 
experiments for a wide range of C values (from 10^'^ to 10^) with fine spacings in [0.1, 10]. 

We mainly experimented with fc = 30 to fc = 500, and 6 = 1, 2, 4, 8, and 16. Figures [T| (average) and|2](std, 
standard deviation) provide the test accuracies. Figure [T]demonstrates that using 6 > 8 and k > 150 achieves about 
the same test accuracies as using the original data. Since our method is randomized, we repeated every experiment 
50 times. We report both the mean and std values. Figure |2] illustrates that the stds are very small, especially with 

6 > 4. In other words, our algorithm produces stable predictions . For this dataset, the best performances were usually 
achieved when C > 1. 



We will let the program run unless it is accidentally terminated (e.g., due to power outage). 
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Figure 1; Linear SVM test accuracy (averaged over 50 repetitions). With k > 100 and 6 > 8. &-bit hashing (soHd) 
achieves very similar accuracies as using the original data (dashed, red if color is available). Note that after fc > 150, 
the curves for 6=16 overlap the curves for 6 = 8. 




Compared with the original training time (about 100 seconds), we can see from Figure |3] that our method only 
need about 3^7 seconds near C — \ (about 3 seconds for h — 8). Note that here the training time did not include 
the data loading time. Loading the original data took about 12 minutes while loading the hashed data took only about 
10 seconds. Of course, there is a cost for processing (hashing) the data, which we find is efficient, confirming prior 
studies |6|. In fact, data processing can be conducted during data collection, as is the standard practice in search. In 
other words, prior to conducting the learning procedure, the data may be already processed and stored by (6-bit) min- 
wise hashing, which can be used for multiple tasks including learning, clustering, duplicate detection, near-neighbor 
search, etc. 

Compared with the original testing time (about 100 ^ 200 seconds), we can see from Figure |4] that the testing 
time of our method is merely about 1 or 2 seconds. Note that the testing time includes both the data loading time 
and computing time, as designed by LIBLINEAR. The efficiency of testing may be very important in practice, for 
example, when the classifier is deployed in an user-facing application (such as search), while the cost of training or 
pre-processing (such as hashing) may be less critical and can often be conducted off-line. 
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Figure 3: Linear SVM Training time. Compared with the training time of using the original data (dashed, red if 
color is available), we can see that our method with 6-bit hashing only needs a very small fraction of the original cost. 



— 10 

CD 

E 



svm: k = 30 
Spam:Testing Tim 


e 






















~— » « »-0 00( 







svm: k = 50 
Spam:Testing Tim 































svm: k = 100 
Spam:Testing Tim 
























.. ...... 







^10' 

CD 



svm: k= 150 
Spam:Testing Time 















































10"" 10' 10' 10" 10' 10' 10"^ 10"' 10"' 10" 10' 10' 10"" 10"' 10"' 10" 10' 10' 10"" 10"' 10"' 10" 10' 10' 

c c c c 



n-3 J n-2 



-10 
E 



svm: 
Spam 


k = 200 
:Testing Time 













































svm: k = 300 
Spam:Testing Tien 































svm: k = 400 
Spam:Testing Timt 































-10 

03 

E 



svm: k = 500 
SpamiTesting Tiiji 





























„-3 J „-2 



„-3 J „-2 



c c c c 

Figure 4: Linear SVM testing time. The original costs are plotted using dashed (red, if color is available) curves. 



5.3 Experimental Results on Logistic Regression 

Figure |5] presents the test accuracies and Figure |7] presents the training time using logistic regression. Again, with 
k > 150 (or even k > 100) and 6 > 8, 5-bit minwise hashing can achieve the same test accuracies as using the original 
data. Figure |6]presents the standard deviations, which again verify that our algorithm produces stable predictions for 
logistic regression. 

From Figure |7] we can see that the training time is substantially reduced, from about 1000 seconds to about 
30 ^ 50 seconds only (unless 6=16 and k is large). 

In summary, it appears 6-bit hashing is highly effective in reducing the data size and speeding up the training (and 
testing), for both (nonlinear and linear) SVM and logistic regression. We notice that when using 6 — 16, the training 
time can be much larger than using 6 < 8. Interestingly, we find that 6-bit hashing can be combined with Vowpal 
Wabbit (VW) ||34l to further reduce the training time, especially when 6 is large. 
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Figure 5: Logistic regression test accuracy. The dashed (red if color is available) curves represents the results using 
the original data. 
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Figure 7: Logistic Regression Training time. The dashed (red if color is available) curves represents the results using 
the original data. 
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6 Random Projections and Vowpal Wabbit (VW) 



The two methods, random projections |[T]|22 and Vowpal Wabbit (VW) ||34l[3Ti are not limited to binary data (although 
for ultra high-dimensional used in the context of search, the data are often binary). The VW algorithm is also related 
to the Count-Min sketch [12J. In this paper, we use "VW" particularly for the algorithm in |34|. 

For convenience, we denote two D-dim data vectors by ui,U2 G K^- Again, the task is to estimate the inner 
product a = Y^f^i ui.iU2,i- 



6.1 Random Projections 

The general idea is to multiply the data vectors, e.g., ui and U2, by a random matrix {vij} G M^^*^, where r^j is 
sampled i.i.d. from the following generic distribution with ||24) 



i?(r„) = 0, Var{n,) = l, E{rl^)^0, E{r%) ^ s>l. 

We must have s > 1 because Var{r1^) = E{rfj) — E'^{r'j^) = s - 1 > 0. 
This generates two fc-dim vectors, Vi and V2'- 



(11) 



D 



D 



Vl. 



^■ui,jry, V2.J ='^U2.irij, j = l,2,...,k 



The general distributions which satisfy (fTTt includes the standard normal distribution (in this case, s = 3) and the 
"sparse projection" distribution specified as 



1 with prob. 
r^j — ^/s X with prob. 1 — 
— 1 with prob. 



(12) 



provided the following unbiased estimator hrp.s of a and the general variance formula: 



1 ^ 



D 



E(arp.s) = Q = 



i=l 



Var{ar 



D D 



D 



i=l 1=1 



^i=l 



i=l 



(13) 



(14) 



which means s = 1 achieves the smallest variance. The only elementary distribution we know that satisfies ( fTTT l with 
s = 1 is the two point distribution in { — 1, 1} with equal probabilities, i.e., (fT2l i with s — 1. 



6.2 Vowpal Wabbit (VW) 

Again, in this paper, "VW" always refers to the particular algorithm in ll34l . VW may be viewed as a "bias-corrected" 
version of the Count-Min (CM) sketch algorithm (TT\. In the original CM algorithm, the key step is to independently 
and uniformly hash elements of the data vectors to buckets £ {l,2,3,...,fc} and the hashed value is the sum of the 
elements in the bucket. That is h{i) = j with probability i, where j G {1, 2, k}. For convenience, we introduce 
an indicator function: 




1 if h{i)=j 
otherwise 



which allow us to write the hashed data as 

D D 

Wl.j = ^^Ui^Jij, W2,j = U2.ilij 
i=l i=l 
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The estimate acm = ^i j^2 j is (severely) biased for the task of estimating the inner products. The origi- 

nal paper (12] suggested a "count-min" step for positive data, by generating multiple independent estimates and 
taking the minimum as the final estimate. That step can not remove the bias and makes the analysis (such as variance) 
very difficult. Here we should mention that the bias of CM may not be a major issue in other tasks such as sparse 
recovery (or "heavy-hitter", or "elephant detection", by various communities). 



proposed a creative method for bias-correction, which consists of pre-multiplying (element-wise) the original 
data vectors with a random vector whose entries are sampled i.i.d. from the two-point distribution in {—1, 1} with 
equal probabilities, which corresponds to s = 1 in ( fT2b . 

Here, we consider a more general situation by considering any s > 1. After applying multiplication and hashing 
on ui and 1*2 as in |34l , the resultant vectors gi and 52 are 



51,. 



D 



92., 



^U2,irilij, j = l,2,...,k 



(15) 



where is defined as in ( fTTT i. i.e., E{ri) = 0, E{rf) = 1, E{rf) = 0, E{rf) = s. We have the following Lemma. 
Lemma 1 



fe 



D 



^ , , 1 



D 




i=l 



(16) 
(17) 



Proof: See Appendix\B\C\ 



Interestingly, the variance ( fTTt says we do need s = 1, otherwise the additional term (s — 1) X^iLi ""i i 
not vanish even as the sample size fc -> cx). In other words, the choice of random distribution in VW is essentially the 
only option if we want to remove the bias by pre-multiplying the data vectors (element-wise) with a vector of random 
variables. Of course, once we let s = 1, the variance ([TtI i becomes identical to the variance of random projections 
Gil. 



7 Comparing 6-Bit Minwise Hashing with VW 

We implemented VW (which, in this paper, always refers to the algorithm developed in fSTl) and tested it on the same 
webspam dataset. Figure [8] shows that 5-bit minwise hashing is substantially more accurate (at the same sample size 
k) and requires significantly less training time (to achieve the same accuracy). For example, 8-bit minwise hashing 
with k = 200 achieves about the same test accuracy as VW with k = 10^. Note that we only stored the non-zeros of 
the hashed data generated by VW. 




k k k k 



Figure 8: The dashed (red if color is available) curves represent 6-bit minwise hashing results (only for k < 500) while 
solid curves represent VW. We display results for C ~ 0.01, 0.1, 1, 10, 100. 

This empirical finding is not surprising, because the variance of 6-bit hashing is usually substantially smaller than 
the variance of VW (and random projections). In Appendix IC] we show that, at the same storage cost, 6-bit hashing 
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usually improves VW by 10- to 100-fold, by assuming each sample of VW requires 32 bits storage. Of course, even if 
VW only stores each sample using 16 bits, an improvement of 5- to 50-fold would still be very substantial. Note that 
this comparison makes sense for the purpose of data reduction, i.e., the sample size k is substantially smaller than the 
number of non-zeros in the original (massive) data. 

There is one interesting issue here. Unlike random projections (and minwise hashing), VW is a sparsity-preserving 
algorithm, meaning that in the resultant sample vector of length k, the number of non-zeros will not exceed the number 
of non-zeros in the original vector In fact, it is easy to see that the fraction of zeros in the resultant vector would be 
(at least) (l — w exp f ), where c is the number of non-zeros in the original data vector When f > 5, then 
exp (— f ) ~ 0. In other words, if our goal is data reduction (i.e., k ^ c), then the hashed data by VW are dense. 

In this paper, we mainly focus on data reduction. As discussed in the introduction, for many industry applications, 
the relatively sparse datasets are often massive in the absolute scale and we assume we can not store all the non-zeros. 
In fact, this is the also one of the basic motivations for developing minwise hashing. 

However, the case of c <C fc can also be interesting and useful in our work. This is because VW is an excellent tool 
for achieving compact indexing due to the sparsity-preserving property. Basically, we can let k be very large (like 2^^ 
in ||34l ). As the original dictionary size D is extremely large (e.g., 2^^), even k = 2^^ will be a meaningful reduction 
of the indexing. Of course, using a very large k will not be useful for the purpose of data reduction. 



8 Combining b-Bit Minwise Hashing with VW 

In our algorithm, we reduce the original massive data to nhk bits only, where n is the number of data points. With 
(e.g.,) k ~ 200 and 6 = 8, our technique achieves a huge data reduction. In the run-time, we need to expand each data 
point into a binary vector of length 2^k with exactly k I's. If h is large like 16, the new binary vectors will be highly 
sparse. In fact, in Figure|3]and Figure|2] we can see that when using b = 16, the training time becomes substantially 
larger than using b < 8 (especially when k is large). 

On the other hand, once we have expanded the vectors, the task is merely computing inner products, for which we 
can actually use VW. Therefore, in the run-time, after we have generated the sparse binary vectors of length 2^k, we 
hash them using VW with sample size m (to differentiate from k). How large should m be? Lemma|2]may provide 
some insights. 

Recall Section|2]provides the estimator, denoted by Rb, of the resemblance R, using 6-bit minwise hashing. Now, 
suppose we first apply VW hashing with size rn on the binary vector of length 2''fc before estimating R, which will 
introduce some additional randomness (on top of 6-bit hashing). We denote the new estimator by Rb,vw Lemma |2] 
provides its theoretical variance. 

Lemma 2 

E (^Rb,vw^ = R (18) 
Var (i?,..) = Var (i?.) + ^^^^ (l + - ^^^) > (19) 
^I Pbjl-Pb) ^ 1 1 + P^ 1 Pb{l + Pb) 

C2,bf ^[l- C2,bf [1 _ ^2,6]' 

where Var (yRbj — \ ^^(T^-^ is given by (|6| and C2.6 is the constant defined in Theorem\l\ 

Proof: The proof is quite straightforward, by following the conditional expectation formula: E{X) = E{E{X\Y)), 
and the conditional variance formula Var{X) = E{Var{X\Y)) + Var{E{X\Y). 

Recall, originally we estimate the resemblance by Rb = yi^c-z'b] ' ^^^''^ ~ T '^^'^ number of matches 

in the two hashed data vectors of length k generated by b-bit hashing. E{T) — kPb and Var{T) — kPb{l — Pb). 
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Now, we apply VW ( of size m and s = 1) on the hashed data vectors to estimate T ( instead of counting it exactly). We 
denote this estimates by T and Pb,vw — 

Because we know the VW estimate is unbiased, we have 



E 



E 



f \ E{T) kPb 



Using the conditional variance formula and the variance ofVW UT) (with s — I), we obtain 

1 



Var 



1 
1 

fc2 



E [ — \ +T-' -2T\ 
m 



1 



kPh(l-Ph 



m 

Pb{l-Pb) ^ 1 
k m 



l + P^ 



Var (T) 



k^P^ - 2kPb) 



Pbil + Pb) 



kPb{l - Pb) 



This completes the proof. □ 



Compared to the original variance Var [Rb] — rTr^r-^. the additional term —tt-tt^ in ( fT9] l can be rela- 

\ / K [i — 02,bl ™ 1-1- — 02.bl 

tively large if m is not large enough. Therefore, we should choose to ^ fc (to reduce the additional variance) and 
TO ^ 2^k (otherwise there is no need to apply this VW step). If 6 = 16, then to = 2*fc may be a good trade-off, 
because k < 2^fc < 2^'^k. 

Figure |8] provides an empirical study to verify this intuition. Basically, as m = 2^k, using VW on top of 16-bit 
hashing achieves the same accuracies at using 16-bit hashing directly and reduces the training time quite noticeably. 




c c c 



Figure 9: We apply VW hashing on top of the binary vectors (of length 2^k) generated by 6-bit hashing, with size 
TO = 2°fc,2ifc,22fc,23fc,2^fc, for k = 200 and b = 16. The numbers on the solid curves (0, 1, 2, 3, 8) are the 
exponents. The dashed (red if color if available) curves are the results from only using 6-bit hashing. When to — 2^k, 
this method achieves the same test accuracies (left two panels) while considerably reducing the training time (right 
two panels), if we focus on C w 1. 

We also experimented with combining 8-bit hashing with VW. We found that we need to = 2^ A; to achieve similar 
accuracies, i.e., the additional VW step did not bring more improvement (without hurting accuracies) in terms of 
training speed when 6 = 8. This is understandable from the analysis of the variance in Lemma|2] 



9 Practical Considerations 

Minwise hashing has been widely used in (search) industry and 6-bit minwise hashing requires only very minimal (if 
any) modifications (by doing less work). Thus, we expect 6-bit minwise hashing will be adopted in practice. It is also 
well-understood in practice that we can use (good) hashing functions to very efficiently simulate permutations. 

In many real-world scenarios, the preprocessing step is not critical because it requires only one scan of the data, 
which can be conducted off-line (or on the data-collection stage, or at the same time as n-grams are generated), and 
it is trivially parallelizable. In fact, because 6-bit minwise hashing can substantially reduce the memory consumption, 
it may be now affordable to store considerably more examples in the memory (after 6-bit hashing) than before, to 
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avoid (or minimize) disk lOs. Once the hashed data have been generated, they can be used and re-used for many 
tasks such as supervised learning, clustering, duplicate detections, near-neighbor search, etc. For example, a learning 
task may need to re-use the same (hashed) dataset to perform many cross-validations and parameter tuning (e.g., for 
experimenting with many C values in SVM). 

Nevertheless, there might be situations in which the preprocessing time can be an issue. For example, when a new 
unprocessed document (i.e. n-grams are not yet available) arrives and a particular application requires an immediate 
response from the learning algorithm, then the preprocessing cost might (or might not) be an issue. Firstly, generating 
n-grams will take some time. Secondly, if during the session a disk lO occurs, then the lO cost will typically mask the 
cost of preprocessing for 6-bit minwise hashing. 

Note that the preprocessing cost for the VW algorithm can be substantially lower Thus, if the time for pre- 
processing is indeed a concern (while the storage cost or test accuracies are not as much), one may want to consider 
using VW (or very sparse random projections 1241 ) for those applications. 

10 Conclusion 

As data sizes continue to grow faster than the memory and computational power, machine-learning tasks in industrial 
practice are increasingly faced with training datasets that exceed the resources on a single server A number of 
approaches have been proposed that address this by either scaling out the training process or partitioning the data, 
but both solutions can be expensive. 

In this paper, we propose a compact representation of sparse, binary datasets based on 6-bit minwise hashing. 
We show that the 6-bit minwise hashing estimators are positive definite kernels and can be naturally integrated with 
learning algorithms such as SVM and logistic regression, leading to dramatic improvements in training time and/or 
resource requirements. We also compare 6-bit minwise hashing with the Vowpal Wabbit (VW) algorithm, which has 
the same variances as random projections. Our theoretical and empirical comparisons illustrate that usually 6-bit 
minwise hashing is significantly more accurate (at the same storage) than VW for binary data. Interestingly, 6-bit 
minwise hashing can be combined with VW to achieve further improvements in terms of training speed when 6 is 
large (e.g., 6 > 16). 
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A Approximation Errors of the Basic Probability Formula 



Note that the only assumption needed in the proof of Theorem[T]is that D is large, which is virtually always satisfied in 
practice. Interestingly, (|4) is remarkably accurate even for very small 13. FigurefTOlshows that when D = 20/200/500, 
the absolute error caused by using (|4) is < 0.01/0.001/0.0004. The exact probability, which has no closed-form, can 
be computed by exhaustive enumerations for small D. 
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Figure 10: The absolute errors (approximate - exact) by using (01 are very small even for D = 20 (left panels), 
D = 200 (middle panels), and D = 500 (right panels). The exact probability can be numerically computed for small 
D (from a probability matrix of size D x D). For each D, we selected three /i values. We always let /2 — 2, 3, /i 
anda = 0, l,2,...,/2. 



B Proof of Lemma [T] 

The VW algorithm ||34l provides a bias-corrected version of the Count-Min (CM) sketch algorithm lfT2l . 



B.l The Analysis of the CM Algorithm 

The key step in CM is to independently and uniformly hash elements of the data vectors to {1, 2, 3, k}. That is 
h{i) = q with equal probabilities, where q £ {1, 2, k}. For convenience, we introduce the following indicator 
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function: 



1 if h{i) = q 
otherwise 



Thus, we can denote the CM "samples" after the hashing step by 

D D 
Wl^q = ^ Ui^Jiq, W2,q = ^ ^2,^/, 



tq 



i=l 



i=l 



and estimate the inner product by Ocm = /2q=i wi^qW2,q, whose expectation and variance can be shown to be 



D D 



D D 



1=1 j=l 



15 



D 



2 „,2 



i—1 i—1 \i=l 

From the definition of liq, we can easily infer its moments, for example, 

1 



(20) 



(21) 



1 if h{i)=q 
otherwise 



Eil?q) 



E{I,gI,g,) =0 if q^q', E{hgU 



ifi + i' 



The proof of the mean (|20] | is simple: 

k D D k 

q=l i=l i—1 q—1 

The variance d^Tl is more complicated: 
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^ Wl,iM2,ii^ {Ifq) + ^ Ul^iU2jE {Uqljq) 



Var{ac„i) = E{al„^) - E'^{ac„i) = X! -^(^1,9^2,9) + X! ^(^i,9'^2,gWi,q'W2,g') - | ^ui,^U2,i + j'^ui,iU2.j 
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The following expansions are helpful: 
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— a^6^ + aibiajbj + o^bjbc + b^ajac + 2aibiajbc + aibjaj)t 



ai6i ^ flifoj = ^ flj + bfaiQj + ^ a^6j6c 
which, combined with the moments of liq, yield 
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B.2 The Analysis of the More General Version of the VW Algorithm 



The nice approach proposed in the VW paper 11341 is to pre-element-wise-muhiply the data vectors with a random 
vector r before taking the hashing operation. We denote the two resuhant vectors (samples) by gi and 32 respectively: 



D D 
9l,q ^'^Ul^iTiliq, g2,q =^^U2,iriIj 



iq 



1=1 



1=1 



where e { — 1,1} with equal probabilities. Here, we provide a more general scheme by sampling from a sub- 
Gaussian distribution with parameter s and 

E{n) - 0, E{rf) = 1, E{rf) = 0, E{rf) = s 

which include normal (i.e., s = 3) and the distribution on {—1, 1} with equal probabilities (i.e., s = 1) as special 
cases. 
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Let a^^,s = J2q=i 9i,q92.,q- The goal is to show 
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We can use the previous results and the conditional expectation and variance formulas: 

E{X) = E{E{X\Y)), Var{X) = E{Var{X\Y)) + Var{E{X\Y)) 

E{n) = 0, E{rj) = 1, E{rl) = 0, E{rf) = s. 



E{Var{av^^s\r)) 
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/c A:2 

As yar(£^(at,^^s|r)) = i?(i?^(di,„,_s|r)) — i?^(i?(dt,u,_s|?')), we need to compute 
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Thus, 
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B.3 Another Simple Scheme for Bias-Correction 

By examining the expectation ( |20] i. the bias of CM can be easily removed, because 
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is unbiased with variance 
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which is essentially the same as the variance of VW. 
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C Comparing 6-Bit Minwise Hashing with VW Random Projections 



We compare VW (and random projections) with 6-bit minwise hashing for the task of estimating inner products on 
binary data. With binai-y data, i.e., ui^i, U2a e {0, 1}, we have /i = YliLi ""i.j' h = YliLi ""2,!:, a = J^tLi ui,iU2,i- 
The variance ([TtI i (by using s = 1) becomes 

V r ^ /1/2 + a'-2a 
Var (a„„,s=i) = 

We can compare this variance with the variance of 6-bit minwise hashing. Because the variance (|6]l is for estimating 
the resemblance, we need to convert it into the variance for estimating the inner product a using the relation: 

^ (/1 + /2) 



l + R 



We can estimate a from the estimated i?. 



Var(a6) = 



^ (/i + /2) 



[l + R) 



For 6-bit minwise hashing, each sample is stored using only 6 bits. For VW (and random projections), we assume 
each sample is stored using 32 bits (instead of 64 bits) for two reasons: (i) for binary data, it would be very unlikely 
for the hashed value to be close to 2° , even when D = 2^4; (ii) unlike 6-bit minwise hashing, which requires exact 
bit-matching in the estimation stage, random projections only need to compute the inner products for which it would 
suffice to store hashed values as (double precision) real numbers. 

Thus, we define the following ratio to compare the two methods. 

_ Var {avw,s=i) x 32 
Var yah) x 6 

If Gy-uj > 1, then 6-bit minwise hashing is more accurate than binary random projections. Equivalently, when 
Gyu] > 1> in order to achieve the same level of accuracy (variance), 6-bit minwise hashing needs smaller storage space 
than random projections. There are two issues we need to elaborate on: 

1. Here, we assume the purpose of using VW is for data reduction. That is, k is small compared to the number of 
non-zeros (i.e., /i, /2). We do not consider the case when k is taken to be extremely large for the benefits of 
compact indexing without achieving data reduction. 

2. Because we assume k is small, we need to represent the sample with enough precision. That is why we assume 
each sample of VW is stored using 32 bits. In fact, since the ratio Gvw is usually very large (e.g., 10 ^ 100) by 
using 32 bits for each VW sample, it will remain to be very large (e.g., 5 ^ 50) even if we only need to store 
each VW sample using 16 bits. 

Without loss of generahty, we can assume /2 < /i (hence a < f2 < fi). Figures [TT] to [14] display the ratios (|24] | 
for 6 = 8, 4, 2, 1, respectively. In order to achieve high learning accuracies, 6-bit minwise hashing requires 6 = 4 (or 
even 8). In each figure, we plot Gbp for fi/D ~ 0.0001, 0.1, 0.5, 0.9 and full ranges of /2 and a. We can see that 
G„u, is much larger than one (usually 10 to 100), indicating the very substantial advantage of 6-bit minwise hashing 
over random projections. 

Note that the comparisons are essentially independent of D. This is because in the variance of binary random 
projection (|24] | the —2a term is negligible compared to in binary data as D is very large. To generate the plots, we 
used D = 10^ (although practically D should be much larger). 



Conclusion: Our theoretical analysis has illustrated the substantial improvements of 6-bit minwise hashing over the 
VW algorithm and random projections in binary data, often by 10- to 100-fold. We feel such a large performance 
difference should be noted by researchers and practitioners in large-scale machine learning. 
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